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Abstract 

As a direct result of intense heat and aridity, deserts are thought to be among the most harsh of 
environments, particularly for their mammalian inhabitants. Given that osmoregulation can be 
challenging for these animals, with failure resulting in death, strong selection should be observed 
on genes related to the maintenance of water and solute balance. One such animal, Peromyscus 
eremicus, is native to the desert regions of the southwest United States and may live its entire 
life without oral fluid intake. As a first step toward understanding the genetics that underlie 
this phenotype, we present a characterization of the P. eremicus transcriptome. We assay four 
tissues (kidney, liver, brain, testes) from a single individual and supplement this with popu- 
lation level renal transcriptome sequencing from 15 additional animals. We identified a set of 
transcripts undergoing both purifying and balancing selection based on estimates of Tajima's 
D. In addition, we used the branch-site test to identify a transcript -Slc2a9, likely related to 
desert osmoregulation - undergoing enhanced selection in P. eremicus relative to a set of related 
non-desert rodents. 

Introduction 

Deserts are widely considered one of the harshest environments on Earth. Animals living in 
desert environments are forced to endure intense heat and drought, and as a result, species liv- 
ing in these environments are likely to posses specialized mechanisms to deal with them. While 
living in deserts likely involves a large number of adaptive traits, the ability to osmoregulate - 
to maintain the proper water and electrolyte balance - appears to be paramount [El]. Indeed, 
the maintenance of water balance is one of the most important physiologic processes for all 
organisms, whether they be desert inhabitants or not. Most animals are exquisitely sensitive 
to changes in osmolality, with slight derangement eliciting physiologic compromise. When the 
loss of water exceeds dietary intake, dehydration - and in extreme cases, death - can occur. 
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26 Thus there has likely been strong selection for mechanisms supporting optimal osmoregulation 

27 in species that live where water is limited. Understanding these mechanisms will significantly 

28 enhance our understanding of the physiologic processes underlying osmoregulation in extreme 

29 environments, which will have implications for studies of human health, conservation, and cli- 

30 mate change. 

31 

32 The genes and structures responsible for the maintenance of water and electrolyte balance 

33 are well characterized in model organisms such as mice [□], rats [EH5I], and humans [EH8I]. These 

34 studies, many of which have been enabled by newer sequencing technologies, provide a founda- 

35 tion for studies of renal genomics in non- model organisms. Because researchers have long been 

36 interested in desert adaptation, a number of studies have looked at the morphology or expression 

37 of single genes in the renal tissues of desert adapted rodents Phyllotis darwini [3] , Psammomys 

38 obesus [110], and Perognathus penicillatus [VI]. More recently, full renal transcriptomes have 

39 been generated for Dipodomys spectabilis and Chaetodipus baileyi, [O] as well as Abrothrix oli- 

40 vacea [E3]. 

41 

42 These studies provide a rich context for current and future work aimed at developing a syn- 

43 thetic understanding of the genetic and genomic underpinnings of desert adaptation in rodents. 

44 As a first step, we have sequenced, assembled, and characterized the transcriptome (using four 

45 tissue types - liver, kidney, testes and brain) of a desert adapted cricetid rodent endemic to the 

46 southwest United States [03], Peromyscus eremicus. These animals have a lifespan typical of 

47 small mammals, and therefore an individual may live its entire life without ever drinking water. 

48 These rodents have distinct advantage over other desert animals (e.g. Dipodomys) in that they 

49 breed readily in captivity, which enables future laboratory studies of the phenotype of interest. 

50 In addition, the focal species is positioned in a clade of well known animals (e.g. P. californicus, 

51 P. maniculatus, and P. polionotus) [113] with growing genetic and genomic resources [ED-US]. 

52 Together, this suggests that future comparative studies are possible. 

53 

54 While the elucidation of the mechanisms underlying adaptation to desert survival is beyond 

55 the scope of this manuscript, we aim to lay the groundwork by characterizing the transcriptome 

56 from four distinct tissues (brain, liver, kidney, testes). These data will be included in the current 

57 larger effort aimed at sequencing the entire genome. Further, via sequencing the renal tissue 

58 of a total of 15 additional animals, we characterize nucleotide polymorphism and genome-wide 

59 patterns of natural selection. Together, these investigations will aid in our overarching goal to 

60 understand the genetic basis of adaptation to deserts in P. eremicus. 
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61 Materials and Methods 

62 Animal Collection and Study Design 

63 To begin to understand how genes may underlie desert adaptation, we collected 16 individuals 

64 from a single population of P. eremicus over a two-year time period (2012-2013). These individ- 

65 uals were captured in live traps and then euthanized using isoflurane overdose and decapitation. 

66 Immediately post-mortem, the abdominal and pelvic organs were removed, cut in half (in the 

67 case of the kidneys), placed in RNAlater and flash frozen in liquid Nitrogen. Removal of the 

68 brain, with similar preservation techniques, followed. Time from euthanasia to removal of all 

69 organs never exceeded five minutes. Samples were transferred to a -80C freezer at a later date. 

70 These procedures were approved by the University of California Berkeley Animal Care and Use 

71 Committee and followed guidelines established by the American Society of Mammalogy for the 

72 use of wild animals in research [D3] . 

73 RNA extraction and Sequencing 

74 Total RNA was extracted from each tissue using a TRIzol extraction (Invitrogen) following the 

75 manufacturer's instructions. Because preparation of an RNA library suitable for sequencing is 

76 dependent on having high quality, intact RNA, a small aliquot of each total RNA extract was 

77 analyzed on a Bioanalyzer 2100 (Agilent, Palo Alto, CA, USA). Following confirmation of sample 

78 quality, the reference sequencing libraries were made using the TruSeq stranded RNA prep kit 

79 (Illumina), while an unstranded TruSeq kit was used to construct the other sequencing libraries, 
so A unique index was ligated to each sample to allow for multiplexed sequencing. Reference 
si libraries (n=4 tissue types) were then pooled to contain equimolar quantities of each individual 

82 library and submitted for Illumina sequencing using two lanes of 150nt paired end sequencing 

83 employing the rapid-mode of the HiSeq 2500 sequencer at The Hubbard Center for Genome 

84 Sciences (University of New Hampshire). The remaining 15 libraries were similarly multiplexed 

85 and sequenced in a mixture of lOOnt paired and single end sequencing runs across several lanes 

86 of an Illumina HiSeq 2000 at the Vincent G. Coates Genome Center (University of California, 

87 Berkeley) . 

ss Sequence Data Preprocessing and Assembly 

89 The raw sequence reads corresponding to the four tissue types were error corrected using the 

90 software bless [EO] using kmer=25, based on the developer's default recommendations. The 

91 error-corrected sequence reads were adapter and quality trimmed following recommendations 

92 from MacManes [EH] and Mbandi [E2]. Specifically, adapter sequence contamination and low 
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93 quality nucleotides (defined as Phred <2) were removed using the program Trimmomatic ver- 

94 sion 0.32 [2.'-i]. Reads from each tissue were assembled using the Trinity version released 17 

95 July 2014 [E31]. We used flags to indicate the stranded nature of sequencing reads and set the 

96 maximum allowable physical distance between read pairs to 999nt. The assembly was con- 

97 ducted on a linux workstation with 64 cores and 512Gb RAM. To filter the raw sequence 
™ assembly, we downloaded Mus musculus cDNA and ncRNA datasets from Ensembl (ftp: 
99 //ftp . ensembl . org/pub/release-75/f asta/mus_musculus/) and the Peromyscus manicula- 
inn tus reference transcriptome from NCBI (ftp : //ftp . ncbi . nlm . nih . gov/genomes/Peromyscus_ 

101 maniculatus_bairdii/RNA/). We used a blastN procedure (default settings, evalue set to 

102 10~ 10 ) to identify contigs in the P. eremicus dataset likely to be biological in origin. This pro- 

103 cedure, when a reference dataset is available, retains more putative transcripts that a strategy 

104 employing expression-based filtering (remove if TMP <1 [E3]) of the raw assembly. We then 

105 concatenated the filtered assemblies from each tissue into a single file and reduced redundancy 

106 using the software cd-hit-est [E3] using default setting, except that sequences were clustered 

107 based on 95% sequence similarity. This multi-fasta file was used for all subsequent analyses, 

108 including annotation and mapping. 

109 

no Assembled Sequence Annotation 

in The filtered assemblies were annotated using the default settings of the blastN algorithm [E3] 

112 against the Ensembl cDNA and ncRNA datasets described above, downloaded on 1 August 

113 2014. Among other things, the Ensemble transcript identifiers were used in the analysis of 

114 gene ontology conducted in the PANTHER package [28]. Next, because rapidly evolving nu- 

115 cleotide sequences may evade detection by blast algorithms, we used HMMER3 [ESI] to search for 

116 conserved protein domains contained in the dataset using the Pfam database [BO]. Lastly, we ex- 
1 1 / tracted putative coding sequences using Transdecoder version 4Jul2014 (http : //transdecoder . 
us sourceforge.net/) 

119 

120 To identify patterns of gene expression unique to each tissue type, we mapped sequence reads 

121 from each tissue type to the reference assembly using bwa-mem [EH] . We estimated expression 

122 for the four tissues individually using default settings of the software eXpress [B3]. Interesting 

123 patterns of expression, including instances where expression was limited to a single tissue type, 

124 were identified and visualized. 

125 
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126 Population Genomics 

127 In addition to the reference individual sequenced at four different tissue types, we sequenced 

128 15 other conspecific individuals from the same population in Palm Desert, California. Sequence 

129 data were mapped to the reference transcriptome using bwa-mem. The alignments were sorted 

130 and converted to BAM format, then passed to the program Angsd version 0.610, which was 

131 used for calculating the folded site frequency spectrum (SFS) and Tajima's D [E3]. 

132 

133 Natural Selection 

134 To characterize natural selection on several genes related to water and ion homeostasis, we iden- 

135 fined several of the transcripts identified as experiencing positive selection in a recent work on 

136 desert-adapter Dipodomys rodents. The coding sequences corresponding to these genes, Solute 

137 Carrier family 2 member 9 (Slc2a9) and the Vitamin D3 receptor (Vdr), were extracted from 

138 the dataset, aligned using the software MACSE [E31] to homologous sequences in Mus musculus, 

139 Rattus norvegicus, Peromyscus maniculatus, and Homo sapiens as identified by the conditional 

140 reciprocal best blast procedure (CRBB, [E5J]). An unrooted gene tree was constructed using the 

141 online resource Clustal- Omega, and the tree and alignment were analyzed using the branch-site 

142 model (model=2, nsSites=2, fix_omega=0 versus model=2, nsSites=2, fix_omega=l, omega=l) 

143 implemented in PAML version 4.8 [EH,EZI]. Significance was evaluated via the use of the likeli- 

144 hood ratio test. 

145 

we Results and Discussion 

147 RNA extraction, Sequencing, Assembly, Mapping 

148 RNA was extracted from the hypothalamus, renal medulla, testes, and liver from each individual 

149 using sterile technique. TRIzol extraction resulted in a large amount of high quality (RIN > 

150 8) total RNA, which was then used as input. Libraries were constructed as per the standard 

151 Illumina protocol and sequenced as described above. The number of reads per library varied 

152 from 56 million strand-specific paired-end reads in Peer360 kidney, to 9 million single-end reads 

153 in Peer321 (Table 1, available on the Short Read Archive accession XXX). Adapter sequence 

154 contamination and low-quality nucleotides were eliminated, which resulted in a loss of <2% of 

155 the total number of reads. These trimmed reads served as input for all downstream analyses. 

156 Table 1 

157 
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Dataset 


Num. Raw Reads 


Peer360 Testes 


32M PE/SS 


Peer360 Liver 


■'ill T T~v -| — I / 1" i f i 

53M PE /SS 


Peer360 Kidney 


56M PE/SS 


Peer360 Brain 


23M PE/SS 


Peer305 


19M PE 


Peer308 


15M PE 


Peer319 


-1 A TV T T "IT 

14M PE 


Peer321 


9M SE 


Peer340 


16M PE 


Peer352 


-l A TV T T "\ T — 1 

14M PE 


Peer354 


9M SE 


Peer359 


-1 A TV X T ~» 7 — \ 

14M PE 


Peer365 


16M PE 


rJiliRoDD 


1 fii\/r w 


PEER368 


14M PE 


PEER369 


14M PE 


PEER372 


17M SE 


PEER373 


23M SE 


PEER380 


16M SE 


PEER382 


14M SE 



159 Table 1. The number of sequencing reads per sample, indicated by Peer[number]. PE=paired 

160 end, SS=strand specific, SE=single end sequencing. 

161 Transcriptome assembly for each tissue type was accomplished using the program Trin- 

162 ity [E3]. The raw assembly for brain, liver, testes, and kidney contained 185425, 222096, 180233, 

163 and 514091 assembled sequences respectively. This assembly was filtered using a blastN pro- 

164 cedure against the Mus cDNA and ncRNA and P. maniculatus cDNAs, which resulted in a 

165 final dataset containing 68331 brain-specific transcripts, 71041 liver-specific transcripts, 67340 

166 testes-specific transcripts, and 113050 kidney-specific transcripts. Mapping the error-corrected 

167 adapter /quality trimmed reads to these datasets resulted in mapping 94.98% (87.01% properly 

168 paired) of the brain-derived reads to the brain transcriptome, 96.07% (88.13% properly paired) 

169 of the liver-derived reads to the liver transcriptome, 96.81% (85.10% properly paired) of the 
no testes-derived reads to the testes transcriptome, and 91.87% (83.77% properly paired) of the 

171 kidney-derived reads to the kidney transcriptome. Together, these statistics suggest that the 

172 tissue-specific transcriptomes are of extremely high quality. All tissue-specific assemblies are to 

173 be made available on Dryad. 
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174 

175 Figure 1 
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176 

177 Figure 1. The Venn Diagram. 

178 We then estimated gene expression on each of these tissue-specific datasets, which allowed 

179 us to understand expression patterns in the multiple tissues. Specifically, we constructed a Venn 

180 diagram (Figure 1) that allowed us to visualize the proportion of genes whose expression was 
lei limited to a single tissue and those whose expression was ubiquitous. 66324 transcripts are 

182 expressed on all tissue types, while 13086 are uniquely expressed in the kidney, 2222 in the 

183 testes, 3580 in the brain, and 2798 in the liver. The kidney appears to an outlier in the number 

184 of unique sequences, though this could be the result of the recovery of more lowly expressed 

185 transcripts that may be the result of deeper sequencing. 

186 

187 In addition to this, we estimated mean TMP (number of transcripts per million) for all tran- 

188 scripts. Table 2 consists of the 10 genes whose mean TMP was the highest. Several genes in this 

189 list are predominately present in a single tissue type. For instance Transcript_126459, Albumin 

190 is very highly expressed in the liver, but less so in the other tissues. It should be noted, however, 

191 that making inference based on uncorrected values for TPM is not warranted. Statistical testing 

192 for differential expression was not implemented due to the fact that no replicates are available. 

193 

194 After expression estimation, the filtered assemblies were concatenated together, and after 

195 the removal of redundancy with cd-hit-est, 123,123 putative transcripts remained (to be made 
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196 available on Genbank). From this filtered concatenated dataset, we extracted 71626 putative 

197 coding sequences (72Mb, to be made available on Dryad). Of these 71626 sequences, 38221 were 

198 complete exons (containing both start and stop codons), while the others were either truncated 

199 at the 5-prime end (20239 sequences), the 3-prime end (6445 sequences), or were internal (6721 

200 sequencing with neither stop nor start codon). The results of a Pfam search conducted on the 

201 predicted amino acid sequences will be found on Dryad. 

202 

203 Table 2 

204 



Transcript ID 


Testes 


Liver 


Kidney 


Brain 


Genbank ID 




Gene ID 


Transcript_83842 


2.05E+03 


6.40E+03 


1.03E+04 


5.47E+03 


DQ073446.1 




COX2 


Transcript_126459 


1.43E+01 


2.22E+04 


2.77E+01 


6.73E+00 


XM_006991665. 


1 


Alb 


Transcript_128937 


4.39E+00 


1.91E+04 


4.74E+02 


2.23E+00 


XM_007627625. 


1 


Apoa2 


Transcript_81233 


1.71E+03 


5.23E+03 


6.11E+03 


3.08E+03 


XM_006993867. 


1 


Fthl 


Transcript_94125 


3.67E+01 


1.08E+04 


2.09E+03 


2.75E+00 


XM_006977178. 


1 


CytP450 


Transcript.l 19945 


5.03E+03 


1.15E+03 


1.33E+03 


3.71E+03 


XM.008686011. 


1 


Ubb 


Transcript_5977 


4.95E+00 


1.01E+04 


3.05E+02 


3.58E+02 


XM_006978668. 


1 


Tf 


Transcript_4057 


2.62E+01 


9.32E+03 


1.34E+02 


8.38E+01 


XM.006994871. 


1 


Apocl 


Transcript.l 12523 


4.07E+02 


7.36E+03 


7.78E+02 


9.54E+02 


XM_006994872. 


1 


Apoe 


Transcript_98376 


1.98E+00 


8.66E+03 


1.02E+00 


2.68E+00 


XM_006970208. 


1 


Ttr 



206 Table 2. The 10 transcripts with the highest mean TPM (transcripts per million). 



207 Population Genomics 

208 As detailed above, RNAseq data from 15 individuals were mapped to the reference transcrip- 

209 tome with the resulting BAM files being used as input to the software package ANGSD. The 

210 Tajima's D statistic was calculated for all transcripts covered by at least 14 of the 15 individ- 

211 uals. In brief, a negative Tajima's D - a result of lower than expected average heterozygosity - 

212 is often associated with purifying or directional selection, recent selective sweep or population 

213 bottleneck. In contrast, a positive value for Tajima's D represents higher than expected average 
2u heterozygosity, often associated with balancing selection. 

215 

216 The distribution of the estimates of Tajima's D for all of the assembled transcripts is shown 

217 in Figure 2. The distribution is skewed toward negative values (mean=-0.89, variance=0.58), 

218 which is likely the result of purifying selection, a model of evolution commonly invoked for 

219 coding DNA sequences [ESI]. II a hie 'A presents the 10 transcripts whose estimate of Tajima's D 

220 is the greatest, while liable 41 presents the 10 transcripts whose estimate of Tajima's D is the 
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221 least. The former list of genes is likely to contain transcripts experiencing balancing selection 

222 in the studied population. This list includes, interestingly, genes obviously related to solute 

223 and water balance (e.g. Clcnkb and a transmembrane protein gene) and immune function (a 

224 interferon-inducible GTPase and a Class 1 MHC gene). The latter group, containing transcripts 

225 whose estimates of Tajima's D are the smallest are likely experiencing purifying selection. Many 

226 of these transcripts are involved in core regulatory functions where mutation may have strongly 

227 negative fitness consequences. 

228 

229 Figure 2 




i 1 1 1 1 1 1 

-3-2-10 1 2 3 
Tajima's D 



23i Figure 2. The distribution of Tajima's D for all putative transcripts. 



232 Table 3 

233 
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Transcript ID 


GenBank ID 


Description 


Tajima's 


Transcript, 


-49049 


XM 


-006533884.1 


heterogeneous nuclear ribonucleoprotein HI (Hnrnphl) 


3.26 


Transcript, 


-38378 


XM 


-006522973.1 


Son DNA binding protein (Son) 


3.19 


Transcript. 


.126187 


NM. 


.133739.2 


transmembrane protein 123 (Tmeml23) 


3.02 


Transcript. 


.70953 


XM. 


.006539066.1 


chloride channel Kb (Clcnkb) 


2.96 


Transeript. 


-37736 


XM 


-006997718.1 


h-2 class I histocompatibility antigen 


2.92 


Transcript. 


-21448 


XM 


-006986148.1 


zinc finger protein 624-like 


2.84 


Transcript. 


.47450 


NM. 


.009560.2 zinc 


finger protein 60 (Zfp60) 


2.82 


Transcript. 


.122250 


XM. 


.006539068.1 


chloride channel Kb (Clcnkb) 


2.81 


Transcript. 


-78367 


XM. 


-006496814.1 


CDC42 binding protein kinase alpha (Cdc42bpa) 


2.78 


Transcript. 


.96470 


XM. 


.006987129.1 


interferon-inducible GTPase 1-like 


2.77 



Table 3. The 10 transcripts with the highest values for Tajima's D, which suggests balancing selection. 



236 Table 4 

237 



Transcript ID 


GenBank ID 


Description 


Tajima's 


Transcript. 


-84359 


XM_006991127.1 


nuclear receptor coactivator 3 (Ncoa3) 


-2.82 


Transcript. 


-87121 


XM_006970128.1 


methyl-CpG binding domain protein 2 (Mbd2) 


-2.82 


Transcript. 


.125755 


EU053203.1 


alpha globin gene cluster 


-2.78 


Transcript. 


.87128 


XM_006976644.1 


membrane-associated ring finger (March5) 


-2.76 


Transeript. 


-55468 


XM_006978377.1 


Vpr binding protein (Vprbp) 


-2.75 


Transcript. 


.116042 


XM_006980811.1 


membrane associated guanylate kinase (Magi3) 


-2.75 


Transcript. 


.18966 


XM.006982814.1 


ubiquitin protein ligase E3 component n-recognin 5 (Ubr5) 


-2.75 


Transcript. 


.122204 


XM.008772511.1 


zinc finger protein 612 (Zfp612) 


-2.75 


Transcript. 


100550 


XM_006971297.1 


bromodomain adjacent to zinc finger domain, IB (Bazlb) 


-2.74 


Transcript. 


.33267 


XM_006975561.1 


pumilio RNA-binding family member 1 (Puml) 


-2.75 



Tabie 4. The 10 transcripts with the lowest values for Tajima's D, which suggests purifying or directional selection. 



240 Natural Selection 

241 To begin to test the hypothesis that selection on transcripts related to osmoregulation is en- 

242 hanced in the desert adapted P. eremicus, we implemented the branch-site test as described 

243 above by setting the sequence corresponding to P. eremicus for both Slc2a9 and Vdr as the 

244 foreground lineages in 2 distinct program executions. These two transcripts were chosen specifi- 

245 cally because they - the former significantly - were recently linked to osmoregulation in a desert 
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246 rodent [H3] . The test for Slc2a9 was highly significant (2ALnl=51.4, df=l, p=0), indicating en- 

247 hanced selection in P. eremicus relative to the other lineages. The branch site test for positive 

248 selection conducted on the Vdr gene was non-significant (2ALnl=0.68, df=l, p=l)- This lim- 

249 ited analysis of selection is to be followed up by an analysis of genome wide patterns of natural 

250 selection. 

251 

252 Conclusions 

253 As a direct result of intense heat and aridity, deserts are thought to be amongst the harshest 

254 environments, particularly for mammalian inhabitants. Given that osmoregulation can be chal- 

255 lenging for these animals - with failure resulting in death - strong selection should be observed on 

256 genes related to the maintenance of water and solute balance. This study aimed to characterize 

257 the transcriptome of a desert-adapted rodent species, P. eremicus. Specifically, we characterized 

258 the transcriptome of four tissue types (liver, kidney, brain, and testes) from a single individual 

259 and supplemented this with population-level renal transcriptome sequencing from 15 additional 

260 animals. We identified a set of transcripts undergoing both purifying and balancing selection 

261 based on Tajima's D. In addition, we used a branch site test to identify a transcript, likely 

262 related to desert osmoregulation, undergoing enhanced selection in P. eremicus relative to a set 

263 of non-desert rodents. 

264 
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